1. Introduction

Inspecting hematopoietic development in the mouse embryo through a dataset of 3,934 cells “with blood-forming potential captured at four time points between E7.0 and E8.5” and their levels of expression for 46 different genes, Moignard et al. utilized unsupervised analysis techniques to reconstruct the “transcriptional programs that underpin organogenesis” (Moignard et al.). In particular, Moignard et al. utilized hierarchical (agglomerative) clustering, dimensionality reduction in the form of multidimensional scaling (MDS), diffusion mapping, and principal components analysis (PCA), and network synthesis to reconstruct boolean update rules to analyze the molecular pathways behind development. We further their analysis by applying additional unsupervised techniques, including k-means clustering, nonlinear (kernel) PCA, local linear embedding (LLE), and association rules to compare with recreated PCA analysis and further confirm their conclusions.

2. Analysis Techniques and Challenges

Despite the presence of a response variable in the form of cell classification, we must use unsupervised learning techniques with the Moignard et al. gene expression data to infer the shape and properties of the probability density of gene expression (ESL 502). That density corresponds to the molecular pathways of hematopoetic cellular development over time. Because the gene expression data is high-dimensional, the curse of dimensionality implies that traditional attempts to estimate the density of gene expression will fail, thus necessitating dimensionality reduction and other techniques (ESL 502).

Dimensionality reduction techniques span from linear methods such as PCA to more recent nonlinear methods such as LLE. Nonlinear methods can be broadly classified into those that preserve global properties such as distance (e.g. the MDS and diffusion map methods used by Moignard et al.) or use nonlinear kernels to relate data points (e.g. kernel PCA) and those that preserve local properties such as LLE (van der Maaten et al.). While nonlinear dimensionality reduction methods perform well on artificial datasets, linear PCA has traditionally outperformed nonlinear methods on natural datasets, as the curse of dimensionality inhibits the performance of local nonlinear methods and algorithms face difficulty in finding extraordinarily small eigenvalues when solving the eigenproblem (van der Maaten et al.).

However, such comparisons have not yet been made for gene expression data. Moreover, while the gene expression data is high-dimensional, the number of samples is significantly larger than the number of predictors. The number of predictors is also less than fifty, and the Moignard et al. analysis suggests the existence of a three-dimensional manifold embedded in this higher-dimensional space which captures most of the dataset’s original variance (Moignard et al.). Thus, we recreate PCA analysis on the normalized gene expression dataset and compare it to various kernel PCAs and LLE.

Clustering techniques model the probability space as a mixture of multiple densities that represent distinct classes of data (ESL 502). Moignard et al. performed agglomerative or hierarchical clustering on both the cell samples and the gene factors, finding the existence of three broad clusters. Cluster I is small, consisting primarily of PS and NP cells, with no expression of erythroid genes and low expression of endothelial genes, but high Cdh1. Cluster II was the largest, consisting of all non-4SG cells, with subclusters of endothelial and hemogenic endothelial cells. Cluster III was comprised almost entirely of 4SG cells, which express hematopoetic genes and low to no levels of endothelial genes (Sox, Hox, and Erg).

To supplement their analysis, we use k-means clustering, which randomly partitions the data set into n clusters, and then iterates over the given cluster sets by repeatedly calculating the centroid of each cluster and assigning each sample to its closest neighboring centroid. Using the within-sum-of-squares (WSS) metric and the GAP statistic, we will be able to find the optimal number of clusters into which the gene expression data partitions.

Association rules analysis builds “conjunctive rules” from high-dimensional binary data in order to describe “regions of high density” (ESL 486). Certain algorithms that find assocation rules have been successfully applied to gene expression data in a supervised setting that seeks to profile phenotypes (Chen et al.). In fact, Moignard et al. create thresholds for binary discretization of their gene expression data for network synthesis constructed from a boolean satisfiability problem (Moignard et al.). Their successful analysis suggests that applying association rules to the discretized binary data might find sets of “on” genes that cluster together and tend to imply other “on” genes. Unfortunately, association rules analysis on even small datasets such as the given binary dataset produces hundreds of thousands to tens of millions of rules. To mitigate this issue, we focus on a narrow problem: demonstrating the roles that Sox and Hox factors play in activating Erg, which controls HSC development.

3. Results

The original PCA analysis demonstrated the existence of a three-dimensional manifold that captures most of the variance of the original high-dimensional data. Movement across the manifold represents development over time, where movement towards and clustering upon a certain portion of the manifold represents differentiation. In particular, the original PCA analysis demonstrated that the 4SG cells were highly-separable from the rest of the cells, whereas the rest of the cells were more gradually and less tightly clustered across the manifold.

The LLE analysis merely confirmed the original PCA analysis without adding any new insights in that it showed the existence of a time gradient for gene expression. However, because LLE created an embedding that collapsed different original high-dimensional data points into the same low-dimensional data points, the embedding lacked granularity. On the other hand, kernel PCAs provided additional granular views of the same high-dimensional gene expression data on a three-dimensional manifold. Each kernel PCA explained marginally more variance than the original PCA analysis, but consumed more computational time and storage. Overall, this indicates that for certain forms of gene expression data involving developing and differentiating cells, vanilla PCA may provide the most simple and least intensive dimensionality reduction analysis of the original data.

The clustering analysis using the optimal number of clusters for the kmeans algorithm demonstrated the correctness of the original hierarchical clustering, in which Moignard et al. claimed the existence of three overarching clusters, with some subclusters in the second and largest cluster. Inspection of the resulting centroids as representative of their overall respective clusters showed additional gradients for endothelial and hematopoetic development.

The association rules analysis demonstrated the viability of linking HoxB4 and Sox17 to the activation of Erg for HSC development. Moreover, the analysis confirmed already-known linkages, such as complexing between Gata1 and Gfi1b. However, the analysis produced tens of millions of rules, even when controlling for high levels of support and confidence, which had large levels of lift. Without sophisticated computing resources, such association rules analysis may not be manageable when the data have more than ten or so features. Still, the use of subsetting in finding sets of “on” genes that imply a certain gene, or finding the implications of a certain gene being “on”, was more effective in demonstrating connections between transcription factors. Future research into association rules analysis, particularly in contexts of high dimension, must be done before it can be used a primary tool for unsupervised analysis.

4. Conclusion

Moignard et al.’s original salient conclusions appear to be correct, including the activation of the endothelial gene Erg by Sox17 and HoxB4 factors and the broad clustering of cells that mixes both cell type and types of genes expressed (endothelial vs. hematopoetic). Moreover, Moignard et al. utilized the optimal dimensionality reduction and rules-based analysis available in this unsupervised learning context, as the overall added complexity of nonlinear dimensionality reductions provides only marginal benefit, and association rules analysis currently lacks the tools and concepts necessary to make sense of implications in high-dimensionsional datasets. At the same time, the kmeans clustering analysis provided a less complicated and more manageable approach to partitioning the cells into clusters, with clear implications drawn from expression levels of the centroids.

5. Supplementary Information

i. Setup

Let’s first take a look at the data provided by Moignard et al. in their paper “Decoding the regulatory network of early blood development from single-cell gene expression measurements”. To get a rough understanding of the values included, we’ll first create boxplots of both the raw and the normalized data.

## Loading required package: ggbiplot
## Loading required package: ggplot2
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
## Loading required package: rgl
## Loading required package: scatterplot3d
## Loading required package: kernlab
## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:scales':
## 
##     alpha
## 
## Loading required package: lle
## Loading required package: MASS
## Loading required package: snowfall
## Loading required package: snow
## Loading required package: arules
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## 
## The following object is masked from 'package:kernlab':
## 
##     size
## 
## The following objects are masked from 'package:base':
## 
##     %in%, write
## 
## Loading required package: arulesViz
## 
## Attaching package: 'arulesViz'
## 
## The following object is masked from 'package:base':
## 
##     abbreviate
## 
## Loading required package: cluster
## typesort sortA sortB
##                     
##           3175   759
ggplot(stack(Raw), aes(x = ind, y = values)) + geom_boxplot(aes(fill = ind))

ggplot(stack(Nor), aes(x = ind, y = values)) + geom_boxplot(aes(fill = ind))

Although the number of samples (cells studied) is greater than the number of predictors or features (transcription factors/genes studied), both the raw and normalized data are still too high-dimensional to be analyzed sensibly in an unsupervised environment without some forms of dimensionality reduction.

dim(Raw)
## [1] 3934   46
dim(Nor)
## [1] 3934   46

From here, we will only be looking at the normalized data.

ii. Principal Components Analysis

Our staple dimensionality reduction tool for continuous data is principal components analysis (PCA), which we perform first on the expression data. In short, PCA identifies new predictors or features (“principal components”) that are linear combinations of the original predictors or features (transcription factors/genes studied) such that each new principal component explains as much of the remaining variance in the data while remaining orthogonal (perpendicular) to the previous principal components. Simply put, the original dataset shows data points scattered throughout some high-dimensional space, and PCA identifies lower dimensional subspaces (implying linearity) that show the same data points with as much similar scatter as possible.

We perform PCA on the data and plot the first two and the first three principal components:

pr.nor <- prcomp(Nor, scale = FALSE)
plot(pr.nor$x[,1:2], col = cellcol, pch = 20, xlab = "PC1", ylab = "PC2")

scatterplot3d(pr.nor$x[,1:3], color = cellcol, pch = 20, xlab = "PC1", ylab = "PC2", zlab = "PC3", angle = 30)

How representative of the original dataset are our new principal components? We can take a look at the proportion of variance explained (PVE) by the principal components:

pve <- 100 * pr.nor$sdev^2/sum(pr.nor$sdev^2)
plot(pve, type = "o", ylab = "PVE", xlab = "Principal Component", col = "blue")

plot(cumsum(pve), type = "o", ylab = "Cumulative PVE", xlab = "Principal Component", col = "brown3")

In particular, the first three principal components have the following respective percentages of variance explained:

pve[1]
## [1] 33.79748
pve[2]
## [1] 23.55951
pve[3]
## [1] 4.829649

Clearly, based on the above graphs, the first two to three principal components provide the most efficient lower-dimensional projection of our high-dimension dataset.

In this case, when we performed PCA on the dataset, we did not scale our expression levels for each gene. After all, expression levels for each gene are already in the same units, and relative amounts of expression qualitatively matter when considering the molecular pathways of blood cell development. Let’s validate this claim by performing PCA on the dataset after scaling each column of the dataset (expression values for each particular gene/transcription factor).

pr.norsc <- prcomp(Nor, scale = TRUE)
summary(pr.norsc)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     3.4332 2.8843 1.51999 1.32798 1.27183 1.03645
## Proportion of Variance 0.2562 0.1809 0.05023 0.03834 0.03516 0.02335
## Cumulative Proportion  0.2562 0.4371 0.48732 0.52565 0.56082 0.58417
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     1.02351 1.00810 0.99139 0.98162 0.95537 0.93766
## Proportion of Variance 0.02277 0.02209 0.02137 0.02095 0.01984 0.01911
## Cumulative Proportion  0.60694 0.62904 0.65040 0.67135 0.69119 0.71031
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.90173 0.86982 0.84576 0.83099 0.81490 0.80272
## Proportion of Variance 0.01768 0.01645 0.01555 0.01501 0.01444 0.01401
## Cumulative Proportion  0.72798 0.74443 0.75998 0.77499 0.78943 0.80344
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.78703 0.76955 0.74952 0.73427 0.69663 0.68136
## Proportion of Variance 0.01347 0.01287 0.01221 0.01172 0.01055 0.01009
## Cumulative Proportion  0.81690 0.82978 0.84199 0.85371 0.86426 0.87435
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.67085 0.66502 0.63076 0.62750 0.61089 0.58028
## Proportion of Variance 0.00978 0.00961 0.00865 0.00856 0.00811 0.00732
## Cumulative Proportion  0.88413 0.89375 0.90240 0.91096 0.91907 0.92639
##                           PC31    PC32    PC33    PC34    PC35    PC36
## Standard deviation     0.57403 0.55970 0.54769 0.53138 0.50888 0.50016
## Proportion of Variance 0.00716 0.00681 0.00652 0.00614 0.00563 0.00544
## Cumulative Proportion  0.93355 0.94036 0.94689 0.95302 0.95865 0.96409
##                           PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.48054 0.46831 0.45562 0.44675 0.43683 0.42257
## Proportion of Variance 0.00502 0.00477 0.00451 0.00434 0.00415 0.00388
## Cumulative Proportion  0.96911 0.97388 0.97839 0.98273 0.98688 0.99076
##                           PC43    PC44    PC45      PC46
## Standard deviation     0.40655 0.36611 0.35454 5.184e-10
## Proportion of Variance 0.00359 0.00291 0.00273 0.000e+00
## Cumulative Proportion  0.99435 0.99727 1.00000 1.000e+00

In this case, the first three principal components still explain more than 45 percent of the original dataset’s variance, implying that even under scaling, PCA can find two to three orthogonal directions through the dataset that explain almost a majority of the variance in the high-dimensional dataset.

Since we know each cell sample’s original cell type, we can actually create a biplot that not only plots each cell sample and each transcription factor by the first two principal components, but also colors and “clusters” each cell sample by cell type:

pp <- ggbiplot(pr.nor, obs.scale = 1, var.scale = 1, scale = 1, groups = celltypes, ellipse = TRUE, circle = TRUE)
pp

The above plot singlehandedly provides extensive orientation regarding gene expression patterns across the five cell types. In particular, the directions of expression for any particular gene match predictions made before and in Moignard et al. For instance, expression of E-cadharin (Cdh1) is restricted to the primitive mesoderm (PS) cells, expression of endothelial genes (such as Sox, Hox, Erg, and Cdh5 factors) is restricted to the non-hematopoetic cell types of 4GF (and likely progenitors in the form of HF and NP cells), and expression of hematopoetic genes (such as HbbbH1, Ikaros, Gata1, and Gfi1b) is restricted to 4G hematopoetic cells. However, it does not provide any additional information regarding which genes activate others during differentiation and over time.

iii. LLE

We first attempt to build an LLE of our gene expression data using ten neighbors. In other words, for each particular high-dimensional sample (cell), the LLE algorithm will find the n-closest neighbors and compute the high-dimensional position of the sample as a linear combination of the actual high-dimensional positions of the neighbors. Then, the LLE algorithm will find a lower dimensional space onto which it projects the samples, while trying to the weights of the linear combinations.

Here, we try LLE with the ten closest neighbors. We are interested in finding an LLE in three dimensions, as our PCA plots demonstrated the presence of a manifold in three dimensions.

l <- lle(X = Norm, m = 3, k = 10)
## finding neighbours
## calculating weights
## computing coordinates
scatterplot3d(l$Y, color = cellcol, pch = 20, xlab = "PC1", ylab = "PC2", zlab = "PC3", angle = 20)

The plot is not smooth because LLE tends to collapse data points in very high dimensions to single points in the specified lower dimensions (Maaten et al.). While the plot does demonstrate a highly linear gradient and preserves the same separation of 4SG cells, the inability to make a granular inspection of the embedding means that the LLE plot merely confirms the existence of the manifold, without actually providing any new insights.

iv. Kernel PCAs

First, we use a polynomial kernel in an attempt to qualitatively improve the amount of variance explained by the principal components. Using the “kernlab” package, we first try PCA using polynomials of degree two and plot the first two principal components:

kp.nor <- kpca(Norm, kernel = "polydot", kpar = list(degree = 2))
plot(pcv(kp.nor)[,1:2], col = cellcol, pch = 20, xlab = "PC1", ylab = "PC2")

Next, we plot the first three principal components:

scatterplot3d(pcv(kp.nor)[,1:3], color = cellcol, pch = 20, xlab = "PC1", ylab = "PC2", zlab = "PC3", angle = 30)

scatterplot3d(pcv(kp.nor)[,1:3], color = cellcol, pch = 20, xlab = "PC1", ylab = "PC2", zlab = "PC3", angle = -150)

The following is a captured image of an interactive plot made using “plot3d” of the first three principal components:

alt text

Graphically, the three-dimensional manifold mapped by the first three principal components of a two-degree polynomial PCA appears “tighter” or more well-defined than the one mapped by the first three PCs of a linear PCA. At the same time, the PVE suggests that this PCA does not analytically “improve” the amount of detail that is captured in the lower-dimensional subspace:

pve <- 100 * eig(kp.nor)/sum(eig(kp.nor))
pve[1:3]/sum(pve)
##     Comp.1     Comp.2     Comp.3 
## 0.32651700 0.23380464 0.05321668
par(mfrow = c(1, 2))
plot(pve, type = "o", ylab = "PVE", xlab = "Principal Component", col = "blue")
plot(cumsum(pve), type = "o", ylab = "Cumulative PVE", xlab = "Principal Component", col = "brown3")

Next, we try PCA using polynomials of degree three and plot the first two principal components:

kp.nordeg3 <- kpca(Norm, kernel = "polydot", kpar = list(degree = 3))
plot(pcv(kp.nordeg3)[,1:2], col = cellcol, pch = 20, xlab = "PC1", ylab = "PC2")

Next, we plot the first three principal components:

scatterplot3d(pcv(kp.nordeg3)[,1:3], color = cellcol, pch = 20, xlab = "PC1", ylab = "PC2", zlab = "PC3", angle = -250)

scatterplot3d(pcv(kp.nordeg3)[,1:3], color = cellcol, pch = 20, xlab = "PC1", ylab = "PC2", zlab = "PC3", angle = 300)

Visually, the displayed manifold is a single, smooth curve that emphasizes the separation of the 4SG cells at the end of this development cycle. Approximately 60% of the original variance is explained by the first three principal components.

pve <- 100 * eig(kp.nordeg3)/sum(eig(kp.nordeg3))
pve[1:3]/sum(pve)
##     Comp.1     Comp.2     Comp.3 
## 0.35284291 0.19460562 0.05832961
par(mfrow = c(1, 2))
plot(pve, type = "o", ylab = "PVE", xlab = "Principal Component", col = "blue")
plot(cumsum(pve), type = "o", ylab = "Cumulative PVE", xlab = "Principal Component", col = "brown3")

Finally, we use a radial basis kernel for PCA. We only compute the first three principal components to save computation and storage, and then plot the first two principal components:

kp.norrbf <- kpca(Norm, kernel = "rbfdot", kpar = list(sigma = 0.0005), features = 3)
plot(pcv(kp.norrbf), col = cellcol, pch = 20, xlab = "PC1", ylab = "PC2")

Next, we plot the first three principal components:

scatterplot3d(pcv(kp.norrbf), color = cellcol, pch = 20, xlab = "PC1", ylab = "PC2", zlab = "PC3", angle = -20)

alt text alt text

Each of these kernels demonstrates the validity of the original PCA plots, in which a three-dimensional manifold was observed. With all three types of kernels, we observed the same clear separation between 4SG (red) cells and the rest of the cells, in addition to a gradient involving the other four types of cells. We must identify clusters and subclusters of cells within the larger gradient to gain a more granular understanding of the dataset.

v. Clustering Analysis

First, we need to determine the optimal number of clusters in the dataset. We plot the WSS (within groups sum of squares) corresponding to the results of kmeans performed for cluster sizes 2 to 15:

wss <- c()
for (i in 2:15) 
  wss[i] <- sum(kmeans(Nor, centers = i)$withinss)
plot(1:15, wss, type = "b", xlab = "Number of clusters", ylab = "WSS")

The “elbow” appears at around 6, so the optimal number of clusters according to this informal graphical method is around 7.

We can also use the GAP statistic to find the optimal number of clusters (Tibshirani et al.):

cGap <- clusGap(Nor, kmeans, 15, B = 100, verbose = interactive())
cGap
## Clustering Gap statistic ["clusGap"].
## B=100 simulated reference sets, k = 1..15
##  --> Number of clusters (method 'firstSEmax', SE.factor=1): 7
##           logW   E.logW       gap      SE.sim
##  [1,] 10.69050 11.19288 0.5023754 0.001256963
##  [2,] 10.57500 11.14463 0.5696355 0.004323713
##  [3,] 10.37354 11.11579 0.7422496 0.001082916
##  [4,] 10.31871 11.09632 0.7776103 0.001187192
##  [5,] 10.27345 11.08967 0.8162187 0.001255677
##  [6,] 10.25050 11.08363 0.8331341 0.001236675
##  [7,] 10.23731 11.07810 0.8407954 0.001209351
##  [8,] 10.23378 11.07318 0.8393961 0.001336545
##  [9,] 10.22139 11.06867 0.8472804 0.001188872
## [10,] 10.21152 11.06481 0.8532941 0.001116645
## [11,] 10.20472 11.06107 0.8563493 0.001227750
## [12,] 10.19530 11.05769 0.8623952 0.001216860
## [13,] 10.19294 11.05475 0.8618066 0.001234045
## [14,] 10.18245 11.05186 0.8694033 0.001176497
## [15,] 10.17301 11.04917 0.8761666 0.001126556

The GAP statistic also demonstrates that the optimal number of clusters is around 7.

Thus, we run the kmeans clustering algorithm on the data to find 7 clusters and inspect the centroids of the clusters (using a garish gradient from black to pink so each cluster looks distinct):

km.out <- kmeans(Nor, centers = 7, nstart = 20)
prs <- data.frame(pr.nor$x)
centroids <- km.out$centers

ggplot(prs, aes(x = PC1, y = PC2, colour = km.out$cluster, shape = celltypes)) + geom_point() + scale_colour_gradient(low="black", high="pink")

We can then compare the levels of expression for certain genes by cluster by inspecting the centroids. First, we take a look at endothelial genes, such as Erg, Sox7, Sox17, HoxB4, and Cdh5.

centroids[,c("Erg", "Sox7", "Sox17", "HoxB4", "Cdh5")]
##          Erg       Sox7      Sox17      HoxB4        Cdh5
## 1  -5.657702  -4.392338 -12.689546  -7.214614   0.4037315
## 2 -13.428199 -12.336337 -12.925395 -10.517164 -13.5222979
## 3  -2.167968  -1.781390  -3.584136  -5.778222   3.6608800
## 4 -13.913095 -12.443622  -7.587657 -13.331000 -13.9137016
## 5 -13.936139 -13.778266 -13.990680 -13.913158 -13.4762738
## 6  -5.515506  -1.696637  -5.996344  -7.278214   0.4924869
## 7 -11.050919  -3.317122  -9.406439 -11.287587  -9.4587777

The cluster with the lowest expression of endothelial genes across the board represents the highly-separated 4SG genes, or Cluster III from the Moignard et al. analysis. The cluster with the highest expression of endothelial genes across the board represents mostly PS and NP cells, or Cluster I from the Moignard et al. analysis. Graphically, as we move away from Cluster I (in either direction on the manifold), the endothelial genes become less and less expressed.

Second, we take a look at hematopoetic genes, such as Hbb-bH1, Gata1, Nfe2, Gfi1b, Ikaros, and Myb.

centroids[,c("HbbbH1", "Gata1", "Nfe2", "Gfi1b", "Ikaros", "Myb")]
##       HbbbH1      Gata1       Nfe2      Gfi1b     Ikaros        Myb
## 1  -6.956291  -7.458188  -6.647620  -3.439809  -3.095237  -2.486049
## 2 -11.927792 -13.638337 -13.510728 -13.580726 -13.725458 -12.432889
## 3 -11.239576 -13.838621 -12.508002 -12.192791 -11.339907 -12.250488
## 4 -11.719080 -12.825515 -13.512701 -13.485849 -13.765449 -12.985882
## 5   6.599948  -2.732422  -1.671084  -1.748268  -2.725703  -4.126775
## 6 -11.731799 -13.645862 -12.745990 -12.862657 -10.060498  -9.166072
## 7 -11.725466 -13.656746 -13.370423 -13.309723 -13.147767 -11.074483

Here, we see clear patterns of differentiation. The cluster with the highest expression of hematopoetic genes across the board represents the highly-separated 4SG genes, or Cluster III from the Moignard et al. analysis. Its neighboring cluster (on the manifold) also expresses high levels of these hematopoetic genes. On the other hand, all other clusters have little to no expression of these genes, representing the other “arm” of the manifold. This indicates that movement on the bottom “arm” sends cells towards a hematopoetic future.

Thus, the kmeans clustering does indicate the correctness of the agglomerative clustering performed in Moignard et al. Below are visualizations of the clusters in three-dimensional space.

alt text alt text

vi. Association Rules Analysis

Here, we perform our association rules analysis using the “arules” package. We first load the binary data and convert it into transactional form. In other words, for each row or cell, the transaction consists of all of those genes that are expressed or “on.”

Norb <- read.csv("~/Downloads/nbt.3154-S6-binary.csv", row.names = 1)
Norbm <- as.matrix(Norb)
Norbt <- as(Norbm, "transactions")
summary(Norbt)
## transactions as itemMatrix in sparse format with
##  1448 rows (elements/itemsets/transactions) and
##  33 columns (items) and a density of 0.6188892 
## 
## most frequent items:
##    Ets2   FoxO4   FoxH1    Ldb1    Tal1 (Other) 
##    1442    1442    1438    1433    1400   22418 
## 
## element (itemset/transaction) length distribution:
## sizes
##   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26 
##   1   6   9  23  33  52  74  88  96  98 104 124 112 140 123 104  96  74 
##  27  28  29  30  31 
##  50  24  12   4   1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   17.00   21.00   20.42   24.00   31.00 
## 
## includes extended item information - examples:
##     labels
## 1 Cbfa2t3h
## 2      Erg
## 3     Ets1
## 
## includes extended transaction information - examples:
##   transactionID
## 1      HFB9_177
## 2     HFB12_136
## 3      NPA6_034

The following is a frequency of expression for each gene across the entire “transaction database,” or the cells:

itemFrequencyPlot(Norbt, support = 0.3, cex.names = 0.5)

To understand the role that Hox and Sox factors play in controlling Erg, we must first find all the rules for which Erg is the implication, or the sets of “on” genes that indicate that Erg is “on” as well. Generally, a rule finds a set of genes, denoted X, that imply a separate set of genes, denoted Y. Note the quality metrics of support and confidence. The support of X refers to the frequency that X appears in the transaction database. The confidence of a rule refers to the proportion the transactions that contain X also contain Y. Finally, the lift of a rule is the ratio of the observed support to that expected if X and Y were independent. Thus, we develop our rules while requiring high levels of support and confidence:

r <- apriori(Norbt, control = list(verbose = F), parameter = list(minlen=2, supp = 0.2, conf = 0.7), appearance = list(default = "lhs", rhs = c("Erg")))
summary(r)
## set of 573623 rules
## 
## rule length distribution (lhs + rhs):sizes
##      2      3      4      5      6      7      8      9     10 
##      6    157   1426   7380  25516  63575 118603 169391 187569 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0     8.0     9.0     8.7    10.0    10.0 
## 
## summary of quality measures:
##     support         confidence          lift      
##  Min.   :0.2003   Min.   :0.7000   Min.   :1.247  
##  1st Qu.:0.2196   1st Qu.:0.8232   1st Qu.:1.466  
##  Median :0.2507   Median :0.8571   Median :1.527  
##  Mean   :0.2712   Mean   :0.8490   Mean   :1.512  
##  3rd Qu.:0.3018   3rd Qu.:0.8820   3rd Qu.:1.571  
##  Max.   :0.5580   Max.   :0.9327   Max.   :1.661  
## 
## mining info:
##   data ntransactions support confidence
##  Norbt          1448     0.2        0.7

The algorithm orders the developed rules by length. Thus, we can inspect the first ten rules and see the connetions between Mecom, Sox17, HoxB4, Meis1, Sox7, Notch1, and Erg:

inspect(r[1:10])
##    lhs         rhs     support confidence     lift
## 1  {Mecom}  => {Erg} 0.2389503  0.8693467 1.548357
## 2  {Sox17}  => {Erg} 0.3466851  0.8408710 1.497640
## 3  {HoxB4}  => {Erg} 0.3743094  0.8262195 1.471545
## 4  {Meis1}  => {Erg} 0.5138122  0.7315634 1.302957
## 5  {Sox7}   => {Erg} 0.5379834  0.7728175 1.376433
## 6  {Notch1} => {Erg} 0.5448895  0.7485769 1.333259
## 7  {Mecom,                                        
##     Meis1}  => {Erg} 0.2313536  0.8862434 1.578451
## 8  {Mecom,                                        
##     Sox7}   => {Erg} 0.2389503  0.8987013 1.600639
## 9  {Mecom,                                        
##     Notch1} => {Erg} 0.2375691  0.8911917 1.587264
## 10 {Etv2,                                         
##     Mecom}  => {Erg} 0.2147790  0.8687151 1.547232

We can also sort the rules by decreasing size of lift:

r.sorted <- sort(r, by = "lift")
inspect(r.sorted[1:10])
##    lhs           rhs     support confidence     lift
## 1  {Cbfa2t3h,                                       
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 2  {Cbfa2t3h,                                       
##     Ets1,                                           
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 3  {Cbfa2t3h,                                       
##     Fli1,                                           
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 4  {Cbfa2t3h,                                       
##     Etv6,                                           
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 5  {Cbfa2t3h,                                       
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1,                                         
##     Tal1}     => {Erg} 0.2009669  0.9326923 1.661179
## 6  {Cbfa2t3h,                                       
##     FoxH1,                                          
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 7  {Cbfa2t3h,                                       
##     FoxO4,                                          
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 8  {Cbfa2t3h,                                       
##     Ets2,                                           
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 9  {Cbfa2t3h,                                       
##     Ets1,                                           
##     Fli1,                                           
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 10 {Cbfa2t3h,                                       
##     Ets1,                                           
##     Etv6,                                           
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179

To see the influence of Hox and Sox factors on Erg, we subset our set of Erg rules to find those including any of HoxB2, HoxB4, HoxD8, Sox17, and Sox7.

r.hoxSox <- subset(r, subset = (lhs %in% c("HoxB2", "HoxB4", "HoxD8", "Sox17", "Sox7")))
summary(r.hoxSox)
## set of 340898 rules
## 
## rule length distribution (lhs + rhs):sizes
##      2      3      4      5      6      7      8      9     10 
##      3     61    565   3172  12046  32828  66584 102875 122764 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    8.00    9.00    8.82   10.00   10.00 
## 
## summary of quality measures:
##     support         confidence          lift      
##  Min.   :0.2003   Min.   :0.7000   Min.   :1.247  
##  1st Qu.:0.2196   1st Qu.:0.8444   1st Qu.:1.504  
##  Median :0.2500   Median :0.8689   Median :1.548  
##  Mean   :0.2677   Mean   :0.8620   Mean   :1.535  
##  3rd Qu.:0.2983   3rd Qu.:0.8873   3rd Qu.:1.580  
##  Max.   :0.5380   Max.   :0.9327   Max.   :1.661  
## 
## mining info:
##   data ntransactions support confidence
##  Norbt          1448     0.2        0.7
inspect(r.hoxSox[1:10])
##    lhs         rhs     support confidence     lift
## 1  {Sox17}  => {Erg} 0.3466851  0.8408710 1.497640
## 2  {HoxB4}  => {Erg} 0.3743094  0.8262195 1.471545
## 3  {Sox7}   => {Erg} 0.5379834  0.7728175 1.376433
## 4  {Mecom,                                        
##     Sox7}   => {Erg} 0.2389503  0.8987013 1.600639
## 5  {HoxB4,                                        
##     Sox17}  => {Erg} 0.2368785  0.8863049 1.578560
## 6  {Lmo2,                                         
##     Sox17}  => {Erg} 0.2265193  0.8453608 1.505636
## 7  {Sox17,                                        
##     Tbx20}  => {Erg} 0.2686464  0.8155136 1.452477
## 8  {Meis1,                                        
##     Sox17}  => {Erg} 0.3301105  0.8675136 1.545092
## 9  {Sox17,                                        
##     Sox7}   => {Erg} 0.3466851  0.8451178 1.505204
## 10 {Notch1,                                       
##     Sox17}  => {Erg} 0.3459945  0.8462838 1.507280

We also take a look at all of those rules that do not include any Hox or Sox factors:

r.notHoxSox <- subset(r, subset = !(lhs %in% c("HoxB2", "HoxB4", "HoxD8", "Sox17", "Sox7")))
summary(r.notHoxSox)
## set of 232725 rules
## 
## rule length distribution (lhs + rhs):sizes
##     2     3     4     5     6     7     8     9    10 
##     3    96   861  4208 13470 30747 52019 66516 64805 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   8.000   9.000   8.524  10.000  10.000 
## 
## summary of quality measures:
##     support         confidence          lift      
##  Min.   :0.2003   Min.   :0.7000   Min.   :1.247  
##  1st Qu.:0.2189   1st Qu.:0.8005   1st Qu.:1.426  
##  Median :0.2521   Median :0.8338   Median :1.485  
##  Mean   :0.2762   Mean   :0.8300   Mean   :1.478  
##  3rd Qu.:0.3073   3rd Qu.:0.8644   3rd Qu.:1.539  
##  Max.   :0.5580   Max.   :0.9135   Max.   :1.627  
## 
## mining info:
##   data ntransactions support confidence
##  Norbt          1448     0.2        0.7

Clearly, while the majority of Erg rules involve some Hox or Sox factors, the quality statistics of those involving Hox or Sox factors are almost equivalent to those not involving Hox or Sox factors. Therefore, we take a more refined approach by examining the subset of rules that involve HoxB4 or Sox17 factors, which Moignard et al. demonstrated as key determinants of Erg.

r.refined <- subset(r, subset = (lhs %in% c("HoxB4", "Sox17")))
summary(r.refined)
## set of 179071 rules
## 
## rule length distribution (lhs + rhs):sizes
##     2     3     4     5     6     7     8     9    10 
##     2    39   349  1899  6992 18431 36045 53613 61701 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   8.000   9.000   8.767  10.000  10.000 
## 
## summary of quality measures:
##     support         confidence          lift      
##  Min.   :0.2003   Min.   :0.7919   Min.   :1.410  
##  1st Qu.:0.2169   1st Qu.:0.8609   1st Qu.:1.533  
##  Median :0.2465   Median :0.8762   Median :1.561  
##  Mean   :0.2557   Mean   :0.8760   Mean   :1.560  
##  3rd Qu.:0.2907   3rd Qu.:0.8908   3rd Qu.:1.587  
##  Max.   :0.3743   Max.   :0.9327   Max.   :1.661  
## 
## mining info:
##   data ntransactions support confidence
##  Norbt          1448     0.2        0.7

Note the qualitative rise in support, confidence, and lift in this refined subset, compared to the previous two subsets. We inspect the first few rules of this refined subset, and then sort it by lift and inspect it again:

inspect(r.refined[1:10])
##    lhs         rhs     support confidence     lift
## 1  {Sox17}  => {Erg} 0.3466851  0.8408710 1.497640
## 2  {HoxB4}  => {Erg} 0.3743094  0.8262195 1.471545
## 3  {HoxB4,                                        
##     Sox17}  => {Erg} 0.2368785  0.8863049 1.578560
## 4  {Lmo2,                                         
##     Sox17}  => {Erg} 0.2265193  0.8453608 1.505636
## 5  {Sox17,                                        
##     Tbx20}  => {Erg} 0.2686464  0.8155136 1.452477
## 6  {Meis1,                                        
##     Sox17}  => {Erg} 0.3301105  0.8675136 1.545092
## 7  {Sox17,                                        
##     Sox7}   => {Erg} 0.3466851  0.8451178 1.505204
## 8  {Notch1,                                       
##     Sox17}  => {Erg} 0.3459945  0.8462838 1.507280
## 9  {Etv2,                                         
##     Sox17}  => {Erg} 0.3197514  0.8297491 1.477831
## 10 {Hhex,                                         
##     Sox17}  => {Erg} 0.3397790  0.8424658 1.500480
r.refined.sorted <- sort(r.refined, by = "lift")
inspect(r.refined.sorted[1:10])
##    lhs           rhs     support confidence     lift
## 1  {Cbfa2t3h,                                       
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 2  {Cbfa2t3h,                                       
##     Ets1,                                           
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 3  {Cbfa2t3h,                                       
##     Fli1,                                           
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 4  {Cbfa2t3h,                                       
##     Etv6,                                           
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 5  {Cbfa2t3h,                                       
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1,                                         
##     Tal1}     => {Erg} 0.2009669  0.9326923 1.661179
## 6  {Cbfa2t3h,                                       
##     FoxH1,                                          
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 7  {Cbfa2t3h,                                       
##     FoxO4,                                          
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 8  {Cbfa2t3h,                                       
##     Ets2,                                           
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 9  {Cbfa2t3h,                                       
##     Ets1,                                           
##     Fli1,                                           
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179
## 10 {Cbfa2t3h,                                       
##     Ets1,                                           
##     Etv6,                                           
##     Hhex,                                           
##     HoxB4,                                          
##     Ikaros,                                         
##     Lyl1,                                           
##     Notch1}   => {Erg} 0.2009669  0.9326923 1.661179

In each of the ten rules, HoxB4 appears with Lyl1 and Notch1, confirming what is already developmentally known about the activation of Erg via HoxB4 (Moignard et al.). We can confirm this by plotting the 25 rules generated from all Hox and Sox factors with the most lift, and comparing it to the plot of the 25 rules generated from only the HoxB4 and Sox17 factors with the most lift:

r.hoxSox.sorted <- sort(r.hoxSox, by = "lift")
plot(r.hoxSox.sorted[1:25], method = "graph")

plot(r.refined.sorted[1:25], method = "graph")

We can reverse this analysis by analyzing what sort of implications the presence of Erg has. Thus, Erg becomes our left-hand side. We lower the requirements on support and confidence to find more rules, and then inspect and graph the results:

rl <- apriori(Norbt, control = list(verbose = F), parameter = list(minlen=2, supp = 0.1, conf = 0.5), appearance = list(default = "rhs", lhs = c("Erg")))
summary(rl)
## set of 23 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2 
## 23 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       2       2       2       2       2 
## 
## summary of quality measures:
##     support         confidence          lift       
##  Min.   :0.2873   Min.   :0.5117   Min.   :0.9088  
##  1st Qu.:0.3999   1st Qu.:0.7122   1st Qu.:1.0017  
##  Median :0.5311   Median :0.9459   Median :1.1094  
##  Mean   :0.4832   Mean   :0.8606   Mean   :1.1545  
##  3rd Qu.:0.5566   3rd Qu.:0.9914   3rd Qu.:1.2420  
##  Max.   :0.5615   Max.   :1.0000   Max.   :1.4976  
## 
## mining info:
##   data ntransactions support confidence
##  Norbt          1448     0.1        0.5
inspect(rl)
##    lhs      rhs          support confidence      lift
## 1  {Erg} => {Sox17}    0.3466851  0.6174662 1.4976399
## 2  {Erg} => {HoxB4}    0.3743094  0.6666667 1.4715447
## 3  {Erg} => {Sfpi1}    0.3425414  0.6100861 1.2389967
## 4  {Erg} => {Myb}      0.2872928  0.5116851 0.9319749
## 5  {Erg} => {Ikaros}   0.2914365  0.5190652 0.9088348
## 6  {Erg} => {Lmo2}     0.3736188  0.6654367 1.2104928
## 7  {Erg} => {Tbx20}    0.4254144  0.7576876 1.2231122
## 8  {Erg} => {Meis1}    0.5138122  0.9151292 1.3029567
## 9  {Erg} => {Sox7}     0.5379834  0.9581796 1.3764326
## 10 {Erg} => {Notch1}   0.5448895  0.9704797 1.3332586
## 11 {Erg} => {Etv2}     0.5214088  0.9286593 1.2450913
## 12 {Erg} => {Hhex}     0.5469613  0.9741697 1.2255411
## 13 {Erg} => {Ets1}     0.5580110  0.9938499 1.2052720
## 14 {Erg} => {Runx1}    0.4779006  0.8511685 0.9971618
## 15 {Erg} => {Lyl1}     0.5283149  0.9409594 1.0996846
## 16 {Erg} => {Cbfa2t3h} 0.5310773  0.9458795 1.1001072
## 17 {Erg} => {Fli1}     0.5600829  0.9975400 1.1093993
## 18 {Erg} => {Etv6}     0.5614641  1.0000000 1.0454874
## 19 {Erg} => {Tal1}     0.5614641  1.0000000 1.0342857
## 20 {Erg} => {Ldb1}     0.5552486  0.9889299 0.9992816
## 21 {Erg} => {FoxH1}    0.5545580  0.9876999 0.9945684
## 22 {Erg} => {FoxO4}    0.5580110  0.9938499 0.9979852
## 23 {Erg} => {Ets2}     0.5614641  1.0000000 1.0041609
plot(rl, method = "graph")

While the rules implying Sox17 and HoxB4 do not have the most support (they are less frequent in the transaction database), they appear to be more significant in that whenever a cell expresses Sox17, it also tends to express Erg (which is similarly shown for HoxB4). The inspection also demonstrates a tight relationship to Sox7. No other relationship comes close.

We can also build all of the possible rules from our transaction data, given some constraints on support and confidence:

rules <- apriori(Norbt, parameter = list(support = 0.2, confidence = 0.6))
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##         0.6    0.1    1 none FALSE            TRUE     0.2      1     10
##  target   ext
##   rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[32 item(s), 1448 transaction(s)] done [0.00s].
## sorting and recoding items ... [30 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [7.54s].
## writing ... [14516733 rule(s)] done [2.36s].
## creating S4 object  ... done [7.00s].
summary(rules)
## set of 14516733 rules
## 
## rule length distribution (lhs + rhs):sizes
##       1       2       3       4       5       6       7       8       9 
##      17     524    6555   46443  213911  693724 1660239 3018109 4238681 
##      10 
## 4638530 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   8.000   9.000   8.662  10.000  10.000 
## 
## summary of quality measures:
##     support         confidence          lift       
##  Min.   :0.2003   Min.   :0.6000   Min.   :0.7379  
##  1st Qu.:0.2196   1st Qu.:0.9052   1st Qu.:1.0189  
##  Median :0.2514   Median :0.9812   Median :1.1240  
##  Mean   :0.2752   Mean   :0.9301   Mean   :1.1740  
##  3rd Qu.:0.3052   3rd Qu.:0.9981   3rd Qu.:1.2775  
##  Max.   :0.9959   Max.   :1.0000   Max.   :2.3498  
## 
## mining info:
##   data ntransactions support confidence
##  Norbt          1448     0.2        0.6

We sort our rules by lift and graph the first thousand in a grouped matrix:

rules.sorted <- sort(rules, by = "lift")
plot(rules.sorted[1:1000], method = "grouped")

The graph shows a tight relationship between Gata1 and Gfi1b, which represents the fact that the two factors complex together “to control expression of genes involved in the development and maturation of erythrocytes and megakaryocytes” (Osawa et al.).

6. References

Chen, Shu-Chuan, Tsung-Hsien Tsai, Cheng-Han Chung, and Wen-Hsiung Li. “Dynamic Association Rules for Gene Expression Data Analysis.” BMC Genomics 16, no. 1 (2015): 786. doi:10.1186/s12864-015-1970-x.

Hastie, Trevor, Robert Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer Series in Statistics. New York, NY: Springer, 2009.

van der Maaten, L.J.P., E.O. Postma, and H.J. van den Herik. Dimensionality reduction: A comparative review. Online Preprint, 2008.

Moignard, Victoria, Steven Woodhouse, Laleh Haghverdi, Andrew J Lilly, Yosuke Tanaka, Adam C Wilkinson, Florian Buettner, et al. “Decoding the Regulatory Network of Early Blood Development from Single-Cell Gene Expression Measurements.” Nature Biotechnology 33, no. 3 (February 9, 2015): 269–76. doi:10.1038/nbt.3154.

Osawa, Mitsujiro, Tomoyuki Yamaguchi, Yukio Nakamura, Shin Kaneko, Masafumi Onodera, Ken-Ichi Sawada, Armin Jegalian, Hong Wu, Hiromitsu Nakauchi, and Atsushi Iwama. “Erythroid Expansion Mediated by the Gfi-1B Zinc Finger Protein: Role in Normal Hematopoiesis.” Blood 100, no. 8 (October 15, 2002): 2769–77. doi:10.1182/blood-2002-01-0182.

Tibshirani, Robert, Guenther Walther, and Trevor Hastie. “Estimating the Number of Clusters in a Data Set via the Gap Statistic.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63, no. 2 (May 2001): 411–23. doi:10.1111/1467-9868.00293.